<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>混合语言域 | Elasticsearch: 权威指南 | Elastic</title>
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
</head>
<body>
<div class="main-container">
    <section id="content">
        
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原文地址: <a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/mixed-lang-fields.html" rel="nofollow">https://www.elastic.co/guide/cn/elasticsearch/guide/current/mixed-lang-fields.html</a>, 版权归 www.elastic.co 所有<br/>
                            英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html</a>
                            </div>
                        <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="languages.html">处理人类语言</a></span>
»
<span class="breadcrumb-link"><a href="language-intro.html">开始处理各种语言</a></span>
»
<span class="breadcrumb-node">混合语言域</span>
</div>
<div class="navheader">
<span class="prev">
<a href="one-lang-fields.html">« 每个域一种语言</a>
</span>
<span class="next">
<a href="identifying-words.html">词汇识别 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="mixed-lang-fields"></a>混合语言域<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/200_Language_intro/60_Mixed_language_fields.asciidoc">edit</a>
</h2>
</div></div></div>
<p>通常,那些从源数据中获得的多种语言混合在一个域中的文档会超出你的控制，
例如从网上爬取的页面：</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">{ "body": "Page not found / Seite nicht gefunden / Page non trouvée" }</pre>
</div>
<p>正确的处理多语言类型文档是非常困难的。即使你简单对所有的域使用 <code class="literal">standard</code> （标准）分析器，
但你的文档会变得不利于搜索，除非你使用了合适的词干提取器。当然，你不可能只选择一个词干提取器。
词干提取器是由语言具体决定的。或者，词干提取器是由语言和脚本所具体决定的。像在 <a class="xref" href="language-pitfalls.html#different-scripts" title="每种书写方式一种词干提取器">每种书写方式一种词干提取器</a> 讨论中那样。
如果每个语言都使用不同的脚本，那么词干提取器就可以合并了。</p>
<p>假设你的混合语言使用的是一样的脚本，例如拉丁文，你有三个可用的选择：</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
切分到不同的域
</li>
<li class="listitem">
进行多次分析
</li>
<li class="listitem">
使用 n-grams
</li>
</ul>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_切分到不同的域"></a>切分到不同的域<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/200_Language_intro/60_Mixed_language_fields.asciidoc">edit</a>
</h3>
</div></div></div>
<p>在 <a class="xref" href="language-pitfalls.html#identifying-language" title="语言识别">语言识别</a> 提到过的紧凑的语言检测可以告诉你哪部分文档属于哪种语言。
你可以用 <a class="xref" href="one-lang-fields.html" title="每个域一种语言">每个域一种语言</a> 中用过的一样的方法来根据语言切分文本。</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_进行多次分析"></a>进行多次分析<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/200_Language_intro/60_Mixed_language_fields.asciidoc">edit</a>
</h3>
</div></div></div>
<p>如果你主要处理数量有限的语言，
你可以使用多个域，每种语言都分析文本一次。</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">PUT /movies
{
  "mappings": {
    "title": {
      "properties": {
        "title": { <a id="CO133-1"></a><i class="conum" data-value="1"></i>
          "type": "string",
          "fields": {
            "de": { <a id="CO133-2"></a><i class="conum" data-value="2"></i>
              "type":     "string",
              "analyzer": "german"
            },
            "en": { <a id="CO133-3"></a><i class="conum" data-value="2"></i>
              "type":     "string",
              "analyzer": "english"
            },
            "fr": { <a id="CO133-4"></a><i class="conum" data-value="2"></i>
              "type":     "string",
              "analyzer": "french"
            },
            "es": { <a id="CO133-5"></a><i class="conum" data-value="2"></i>
              "type":     "string",
              "analyzer": "spanish"
            }
          }
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO133-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>主域 <code class="literal">title</code> 使用 <code class="literal">standard</code> （标准）分析器</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO133-2"><i class="conum" data-value="2"></i></a><a href="#CO133-3"></a><a href="#CO133-4"></a><a href="#CO133-5"></a></p>
</td>
<td align="left" valign="top">
<p>每个子域提供不同的语言分析器来对  <code class="literal">title</code> 域文本进行分析。</p>
</td>
</tr>
</table>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_使用_n_grams"></a>使用 n-grams<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/200_Language_intro/60_Mixed_language_fields.asciidoc">edit</a>
</h3>
</div></div></div>
<p>你可以使用 <a class="xref" href="ngrams-compound-words.html" title="Ngrams 在复合词的应用">Ngrams 在复合词的应用</a> 中描述的方法索引所有的词汇为 n-grams。
大多数语型变化包含给单词添加一个后缀（或在一些语言中添加前缀），所以通过将单词拆成 n-grams，你有很大的机会匹配到相似但不完全一样的单词。
这个可以结合 <em>analyze-multiple times</em> （多次分析）方法为不支持的语言提供全域抓取：</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">PUT /movies
{
  "settings": {
    "analysis": {...} <a id="CO134-1"></a><i class="conum" data-value="1"></i>
  },
  "mappings": {
    "title": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "de": {
              "type":     "string",
              "analyzer": "german"
            },
            "en": {
              "type":     "string",
              "analyzer": "english"
            },
            "fr": {
              "type":     "string",
              "analyzer": "french"
            },
            "es": {
              "type":     "string",
              "analyzer": "spanish"
            },
            "general": { <a id="CO134-2"></a><i class="conum" data-value="2"></i>
              "type":     "string",
              "analyzer": "trigrams"
            }
          }
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO134-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>在 <code class="literal">analysis</code> 章节, 我们按照 <a class="xref" href="ngrams-compound-words.html" title="Ngrams 在复合词的应用">Ngrams 在复合词的应用</a> 中描述的定义了同样的 <code class="literal">trigrams</code> 分析器。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO134-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>在 <code class="literal">title.general</code> 域使用 <code class="literal">trigrams</code> 分析器索引所有的语言。</p>
</td>
</tr>
</table>
</div>
<p>当查询抓取所有 <code class="literal">general</code> 域时，你可以使用 <code class="literal">minimum_should_match</code> （最少应当匹配数）来减少低质量的匹配。
或许也需要对其他字段进行稍微的加权，给与主语言域的权重要高于其他的在 <code class="literal">general</code> 上的域：</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">GET /movies/movie/_search
{
    "query": {
        "multi_match": {
            "query":    "club de la lucha",
            "fields": [ "title*^1.5", "title.general" ], <a id="CO135-1"></a><i class="conum" data-value="1"></i>
            "type":     "most_fields",
            "minimum_should_match": "75%" <a id="CO135-2"></a><i class="conum" data-value="2"></i>
        }
    }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO135-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>所有 <code class="literal">title</code> 或 <code class="literal">title.*</code> 域给与了比 <code class="literal">title.general</code> 域稍微高的加权。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO135-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">minimum_should_match</code>（最少应当匹配数） 参数减少了低质量匹配的返回数, 这对 <code class="literal">title.general</code> 域尤其重要。</p>
</td>
</tr>
</table>
</div>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="one-lang-fields.html">« 每个域一种语言</a>
</span>
<span class="next">
<a href="identifying-words.html">词汇识别 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>